A Method to Find Community Structures Based on Information Centrality 
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Community structures are an important feature of many social, biological and technological net- 
works. Here we study a variation on the method for detecting such communities proposed by Girvan 
and Newman and based on the idea of using centrality measures to define the community boundaries 
( M. Girvan and M. E. J. Newman, Community structure in social and biological networks Proc. 
Natl. Acad. Sci. USA 99, 7821-7826 (2002)). We develop an algorithm of hierarchical clustering 
that consists in finding and removing iteratively the edge with the highest information centrality. 
We test the algorithm on computer generated and real-world networks whose community structure 
is already known or has been studied by means of other methods. We show that our algorithm, 
although it runs to completion in a time Oin^), is very effective especially when the communities 
are very mixed and hardly detectable by the other methods. 



I. INTRODUCTION 

Network analysis has revealed as a powerful approach 
to understand complex phenomena and organization in 
social, biological and technological systems J,, 2, 3, 4, 5]. 
In the framework of network analysis a given system is 
modeled as a graph in which the nodes are the elements 
of the system, for instance the individuals in a social 
system, the neurons in a brain and the routers in the 
Internet, and the edges represent the interactions, social 
links, synapses and electric wirings respectively, between 
couples of elements. A lot of interest has been focused on 
the characterization of various structural and locational 
properties of the network Q, |M ^ 1^ - Among the oth- 
ers, an important property common to many networks is 
the presence of subgroups or community structures. 
For instance, in social networks some individuals can be 
part of a tightly connected group or of a closed social 
elite, others can be completely isolated, while some oth- 
ers may act as bridges between groups. The differences 
in the way that individuals are embedded in the struc- 
ture of groups within the network can have important 
consequences on the behavior they are likely to practice. 
The division of the individuals of a social network into 
communities is a fundamental aspect of a social system. 
In fact, subgroups in social systems often have their own 
norms, orientations and subcultures, sometimes running 
counter to the official culture, and are the most impor- 
tant source of a person's identity . For this reason one 
of the main concerns, since the very beginning of social 
network analysis, has been the definition and the iden- 
tification of subgroups of individuals within a network. 
And the first algorithms to find community structures 
have been proposed in social network analysis. 
Subgroups are also important to other networks. The 
presence of subgrouping in biological and technological 
networks may hinder important information on the func- 
tioning of the system, and can be relevant to understand 
the growth mechanisms of such networks. In fact, com- 
munities in the World- Wide- Web may represent pages on 



common topics, while community in cellular [oj and ge- 
netic networks might represent functional modules j3| ■ 
For this reason, the techniques to find the substructures 
within a network provide a powerful tool for understand- 
ing the structure and the functioning of the network. 
In this paper we present a new method to discover com- 
munity structures that uses the recently introduced infor- 
mation centrality measure lOlllOl . based on the concept of 
network global efficiency [l^, [l3 • The information cen- 
trality is here used to quantify the relevance of each of 
the edges in the network. The method consists in finding 
and removing the edges with the highest centrality score 
until the network breaks up into components. 

The paper is organized as follows. In Section ^ we 
review the definitions of cliques and cohesive subgroups 
and the standard methods for finding community struc- 
tures in networks. In Section IIIII we propose the new 
method and describe its implementation. In Section llVI 
we discuss the application of the algorithm to computer- 
generated networks for which there is already a knowl- 
edge and control on the existing subgroups. We show 
that the algorithm, although slower than the best meth- 
ods on the market, can be extremely effective at discover- 
ing community structures, especially when the communi- 
ties are very mixed and hardly detectable. Finally in Sec- 
tion |V| we discuss a number of applications to real-world 
networks. In Section IVll we present our conclusions. 



II. DEFINITION OF COHESIVE SUBGROUPS 

Social analysts were the first to formalize the idea of 
communities and to devise mathematical measures of the 
number and cohesion of communities. Here we review 
the most important definitions developed for social sys- 
tems. For this reason the discussion of this section will 
be mainly in terms of social networks, although, as we 
will see in the following sections, the ideas of community 
structures applies as well to other networks. A commu- 
nity, or cluster, or cohesive subgroup is a subset of indi- 
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viduals among whom there are relatively strong, direct, 
intense ties. The starting point of all the definitions and 
measures is the concept of subgraph. A subgraph is any 
collection of nodes selected from the nodes of the whole 
graph, together with the edges connecting those nodes. 
A random sample of points in a graph representing a so- 
cial system is for example a subgraph but it is not likely 
to correspond to any meaningful social group. The no- 
tion of a meaningful social group is based on the property 
of cohesion among the various members of the subgraph. 
However the cohesion of a subgraph can be quantified 
by using various different properties of the ties among 
subsets of nodes. The choice of a particular property in- 
stead of another depends on the researcher's decision that 
a particular mathematical criterion can be given a mean- 
ingful and useful sociological interpretation. The general 
aim is to define a meaningful social category by investi- 
gating the structural properties of the whole graph and 
finding the naturally existing communities into which the 
social network can be divided. 

The literature on cohesive subgroups contains various 
ways to conceptualize the idea of subgroups in social 
networks. In particular, there are four main ideas that 
take into account four different structural properties 0. 
The resulting four categories of cohesive subgroups are 
sorted in such a way that going from the first to the last 
one we weaken the properties that the subgroups have 
to fulfill. We briefly present these ideas for one-mode, 
non-directed, non-valued graphs. 

1) The mutuality of ties. Cohesive subgroups based 
on the mutuality of ties require that all pairs of subgroup 
members choose each other. This idea is formalized in 
the definition of cliques. A clique is a maximal complete 
subgraph of three or more nodes, i.e. a subset of nodes 
all of which are adjacent to each other and there are no 
nodes that are also adjacent to all the members of the 
clique. 

2) The closeness or reachability of the members 
of the subgroup. Since the definition of clique is rather 
strong and restrictive for real social networks, a num- 
ber of extensions of the basic idea have been proposed. 
Cohesive subgroups based on reachability require that 
all the members are reachable from each other. The n- 
cliques extend the notion of cliques, weakening the re- 
quirement of adjacency among all the subgroup members. 
A n- clique is a maximal subgraph in which the largest 
geodesic distance between any two nodes is no greater 
than n. When n = 1 we go back to the concept of clique. 
2-cliques are subgraphs in which all nodes need not to 
be adjacent but are reachable through at most one inter- 
mediary. In 3-cliques all nodes are reachable through at 
most two intermediaries, and so on. 

A definition that will be important in the following of the 
paper is that of component. A component is the maximal 
connected subgraph, i.e. a subgraph in which there is a 
path between all pairs of nodes, while there is no path 
between a node in the subgraph and any node not in the 
subgraph. 



3) The frequency of ties among members. This 
idea of cohesive subgroups is based on restrictions on the 
minimum number of actors adjacent to each other in a 
subgroup. Whereas the concept of n-clique involves in- 
creasing the permissible path lengths, an alternative way 
to relax the strong assumption of cliques involves reduc- 
ing the number of other nodes to which each node must 
be connected. A k-plex is a maximal subgraph contain- 
ing n nodes in which each node is adjacent to no fewer 
than n — k nodes in the subgraph. Compared to n-clique 
analysis, k-plex analysis tends to find a relatively large 
number of smaller groups. 

4) The relative frequency of ties among subgroup 
members compared to non-members. This idea of cohe- 
sive subgroups is different from the previous three be- 
cause it is based on the comparison of ties within the 
subgroup to ties outside the subgroup p^ . In this way 
cohesive subgroups are seen as areas of relatively high 
density in the graph, parts that are locally denser than 
the field as a whole. The LS set is the simplest formal 
definition of a subgroup in this class. An LS set is a set of 
nodes S such that any of its proper subsets (i.e. any pos- 
sible subset of nodes that can be selected from the nodes 
in S) has more ties to its complement within S than to 
the outside of S The fact that LS sets are related 
by containment implies that there is a hierarchy of LS 
sets in a graph. The definition of lambda sets extends 
that of LS sets, and is based on the concept of edge con- 
nectivity. The edge connectivity of a pair of nodes i and 
j is equal to the minimum number of edges that must 
be removed from the graph in order to leave no path be- 
tween the two nodes. A set of nodes 5 is a lambda set if 
any pair of nodes in S has larger edge connectivity than 
any pair of nodes consisting of one node within S and a 
node outside S |(1J|. Lambda sets are based on the idea 
that a cohesive subgroup is relatively robust, namely it 
is hard to disconnect by the removal of edges. An alter- 
native approach based on the same idea is to consider 
if there are edges in the graph which, if removed, would 
result in a disconnected structure. This approach is easy 
to implement into an algorithmic procedure and allows 
to develop hierarchical clustering methods. Such meth- 
ods rank and remove the edges of the network in terms 
of their importance, where the edge importance can be 
defined in different ways as will be clear in a moment. 
By doing this repeatedly the network breaks iteratively 
into smaller and smaller components until it breaks into 
a collection of single non-connected nodes. The resulting 
hierarchical structure to clusters can be represented by 
dendrograms, or hierarchical trees, as the one reported in 
Fig. n showing the clusters produced at each step of the 
subdivision. 

Recently, Girvan and Newman have considered two forms 
of edge betweenness to measure the edge importance: 
the shortest path betweenness and the random-walk be- 
tweenness [13, llE ■ The edge shortest path between- 
ness extends to the edges the node betweenness proposed 
by Freeman [23 as a centrality measure for the nodes. 
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and is defined as the number of shortest paths between 
pairs of nodes that run through that edge 0|. The 
random-walk betweenness does consider random walks 
connecting all couples of nodes instead of the shortest 
paths (random walks have also been used to quantify the 
similarities-dissimilarities between nearest-neighbouring 
nodes in other algorithms for finding communities 
The algorithms by Girvan and Newman at each step iden- 
tify and remove the edges that are the most between cou- 
ples of nodes, in the sense that they are responsible for 
connecting many pairs of nodes. The method for finding 
community structures that we present in this paper is a 
modification of the method by Girvan and Newman. In 
our method we propose to identify directly the edges that 
when removed mostly disrupt the network's ability in ex- 
changing information among the nodes. In fact, instead 
of the edge betweenness, we adopt a measure of central- 
ity, the information centrality |2, 0| , based on the 
concept of efficient propagation of information over the 
network 0, 0| . The information centrality has revealed 
as an interesting quantity to characterize the centrality 
of the nodes of a network, and gives different results from 
the betweenness centrality 9]. For this reason we think 
that it might be useful to develop an algorithm of hierar- 
chical clustering based on the edges information central- 
ity. 

After having described the formal definitions of cohesive 
subgroups based on the relative frequency of ties, we need 
to give some methods for assessing the cohesiveness of 
the subgroups. This is especially important in hierar- 
chical clustering methods where one obtains a hierarchy 
of community structures, from the original graph to the 
extreme case in which all the nodes are disconnected: in 
this case the number of communities depends on the level 
at which the graph is partitioned, and we therefore need 
a criterium to say at which point to stop. One of the first 
measures of how cohesive a subgroup is, was proposed in 
Ref. 22] and is just the ratio of the number of ties (or 
the average strength of ties for a valued graph) within 
a subgroup divided by the number of ties from the sub- 
group to nodes outside the subgroup. This measure was 
recently extended in Ref. by the measure of modu- 
larity that we will discuss in Sect ion Hvl and which proves 
to be successful to express the degree of cohesiveness of 
the communities of many networks. This is why it was 
recently proposed in Ref. [2^ to adopt the modularity 
itself as the quantity to maximize so to identify the best 
community structure. The numerical implementation of 
this maximization allows to analyze very large networks 
because it can be performed in a time which is by far 
shorter than the time required by all the previous algo- 
rithms. 



III. OUR METHOD FOR FINDING 
COMMUNITIES 

The algorithm for finding structures we propose here 
makes use of a recently introduced centrality measure 

01 , that is based on the concept of efficient propa- 
gation of information over the network ■ We as- 
sume that the network we want to analyze can be repre- 
sented as a connected, non-directed, non- valued graph G 
of N nodes and K edges. However, the extension to non- 
symmetric and valued data does not present any special 
problem and will be considered in a forthcoming paper 
[13 • The graph G is described by the adjacency matrix 
a, a N X N matrix whose entry is equal to 1 if i and 
j are adjacent and otherwise. Two nodes in the graphs 
are said adjacent if they are connected by an edge. The 
entries on the main diagonal are undefined, and for con- 
venience they are set to be equal to 0. We now give some 
definition that will be useful in the following. A walk is 
an alternating sequence of nodes and edges, where each 
edge is linked to both the preceding and the succeeding 
node. A path linking two nodes i and j is a walk from i 
to j in which all points and edges are distinct: the length 
of the path is the number of edges traversed to get from 
i to j. The shortest path, or geodesic, between i and j is 
any path from i to j containing the minimum number of 
edges. 

In order to describe how efficiently the nodes of the net- 
work G exchange information we use the network effi- 
ciency E, a measure introduced in refs. Pin. Such 
a variable is based on the assumption that the informa- 
tion/communication in a network travels along the short- 
est paths (geodesies), and that the efficiency e.y in the 
communication between two nodes i and j is equal to the 
inverse of the shortest path lenght dij . The efficiency of 
G is the average of : 



E[G] 



N{N-1) N{N-1) 



E 



1 



(1) 



and measures the mean fiow-rate of information over G. 
The quantity E[G] varies in the range [0, 1], and is per- 
fectly defined also in the case of non-connected graphs. 
In fact, when there is no path between i and j, we assume 
dij = -l-cxo and consistently = 0. Such a property will 
be extremely important for our algorithm. 
A measure of node centrality, the so called information 
centrality, based on the network efficiency, has been re- 
cently proposed 0- The same measure can be used to 
quantify the importance of groups and classes 0, . 

Here we use such measure to quantify the importance 
of an edge of the graph G. The information centrality Cj, 
of the edge k is defined as the relative drop in the network 
efficiency caused by the removal of the edge from G: 



C, 



j_AE _ E[G] - E[G',] 



E 



E[G] 



k = l,...,K (2) 



Here by G'^ we indicate a graph with N points and K —1 
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edges obtained by removing the edge k from G. Notice 
that this measure is perfectly defined also when G'^ is a 
non-connected graph. 

The method for finding the hierarchy of cohesive sub- 
groups in G consists in the iterative removal of the edges 
with the highest information centrality, until the system 
breaks up into components. We expect that the edges 
that lie between communities are those with the high- 
est information centrality, while those inside communi- 
ties have a low information centrality. The general form 
of the algorithm is the following: 

1. Calculate the information centrality score for each 
of the edges. 

2. Remove the edge with the highest score. 

3. Perform an analysis of the network's components. 

4. Go back to point 1 until all the edges are removed 
and the system breaks up into N non-connnected 
nodes. 

As in the Girvan and Newman algorithms [tRIT^ . the re- 
calculation of the information centrality scores every time 
after an edge as been removed appears to be an impor- 
tant aspect of the algorithm. We will discuss this point in 
Section O The calculation of all the shortest paths, nec- 
essary to compute the efficiency of the network, can be 
performed with a breadth-first search algorithm in time 
0{KN) m 113. Then the calculation of the informa- 
tion centrality for all the edges takes a time 0{K^N). 
This time is comparable to the time it takes to compute 
the random- walk betweenness for all the edges 0, but 
is longer than the time 0{KN) it takes to calculate the 
shortest paths betweenness for all the edges used in the 
method of Ref . |0| . The algorithm repeats the calcula- 
tion of all the information centralities for each edge re- 
moved, i.e. K times. In conclusion, the entire community 
structure algorithm based on the information centrality 
can be completed in time 0{K^N), or time 0{N'^) for 
a sparse graph. Although, as we will show in Section 
IIVI the algorithm can be in some cases better in find- 
ing community structures than the algorithm based on 
shorthest path betweenness, for its poor performance it 
can be used only for graphs with up to a thousand of 
nodes. For extremely large networks the best algorithm 
to be used is the one proposed in Ref. 0| and based 
on the maximization of the modularity that runs in time 
0{KN) or 0{N'^) on a sparse graph, or the one proposed 
in Ref. |2^ based on the notion of voltage drops across 
the network and running in time 0{K -\- N). 



IV. TESTING THE METHOD ON COMPUTER 
GENERATED NETWORKS 

We first applied our algorithm to computer generated 
networks, i.e. random graphs constructed in such a way 
that they have a well defined community structure. All 
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FIG. 1: Dendrogram of the communities found by applying 
our algorithm to a computer generated random graph with 
64 vertices and 256 edges. The random graph has been ob- 
tained by dividing the nodes into 4 groups of 16 nodes each 
(respectively empty circles, full circles, triangles and squares) 
and considering 2i„ = 6, Zout = 2 (see text). In the top panel 
the value of Q corresponding to the various divisions of the 
dendrogram is reported. 



graphs have the same number of nodes, 128, and the 
same number of edges, 1024. The nodes are divided into 
four classes, which are the groups 1-32, 33-64, 65-96 and 
97-128. We fixed to 16 the average number of edges per 
node, and we label the edges according to whether they 
connect members of the same group or not. The mixing 
between the classes is introduced by tuning the average 
number of edges connecting nodes belonging to differ- 
ent classes. From a generic vertex of the graph we have 
on average Zi„ edges which join it to other vertices of 
its group and Zout edges connecting it to vertices of the 
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other groups. The two numbers are not independent, as 
we must of course have Zi„ + Zout = 16. We remark that 
this is the same set of graphs that Newman and pre- 
viously Girvan and Newman \vA have used to test 
their algorithms. In this way we are able to compare 
directly the role of edge betweenness and edge informa- 
tion centrality in determining the community structure. 
As a practical example we show in Fig. ^ the dendro- 
gram corresponding to the analysis with our method of 
a graph of this type, where for illustration purposes we 
take a smaller network with 64 nodes and 8 edges per 
node. Here, Zin = 6 and Zout = 8 — Zi„ = 2, i.e. the 
network is strongly clustered. The algorithm produces a 
hierarchy of subdivisions of the network: from a single 
component to N isolated nodes, going from top to bot- 
tom in the dendrogram (left to right in the figure). To 
know which of the divisions is the best one for a given 
network i.e. where we have to cut the hierarchical tree, 
we need to use a measure of the cohesiveness of the com- 
munities. The first measure of how cohesive a subgroup 
is, was proposed in Ref. [22|. If there are N nodes in 
the graph G and Ns nodes in the subgroup S, the cohe- 
siveness of subgroup S can be defined as the ratio of the 
number of ties within subgroup S divided by the number 
of ties from S to outsiders : 

This measure was recently extended by Girvan and New- 
man in Ref. into the measure of modularity, that 
allows to consider more than a group at the same time 
and tell us how good a subdivision of G in n subgroups 
is. 




number of edges removed 



FIG. 2: Information centrality of the edge removed, global 
efficiency E, number of components n and modularity Q for 
the resulting graph as a function of the number of edges re- 
moved. 

The modularity Q is defined in the following way. Let 
us suppose that we want to test the goodness of a subdi- 
vision of the network in n well defined communities. We 



expect that a good split is obtained if most of the edges 
fall inside the communities, with comparatively few edges 
joining the communities to each other. For this purpose 
one introduces a, n x n symmetric matrix e whose ele- 
ment Cij is the fraction of all edges in the network that 
link vertices in community i to vertices in community j 
[28l |. The trace of this matrix Tre = X)i gives the 
fraction of edges in the network that connect vertices in 
the same community. To try just to maximize the value 
of the trace does not help because by considering the 
whole network as a single community we would get the 
maximal value 1 without doing any subdivision at all. 
Therefore we further define the row (or column) sums 
ai = Cij , which represent the fraction of edges that 
connect to vertices in community i. If the network is 
such that the probability to have an edge between two 
sites is the same regardless of their eventual belonging to 
the same community (random network), we would have 
Gij = aiUj. The modularity is defined as 

Q = ^(e..-a?)=Tre-||e2|| (4) 

i 

where ||e^|| indicates the sum of the elements of the ma- 
trix e^. This quantity then measures the degree of corre- 
lation between the probability of having an edge joining 
two sites and the fact that the sites belong to the same 
community. It now makes sense to look for high values 
of Q. In fact, if we take the whole network as a single 
community, we get Q — and we can easily get higher 
values by choosing subdivisions in more than just a single 
class. Values approaching Q — 1, which is the maximum, 
indicate strong community structure; on the other hand, 
for a random network Q = 0. The expression is not 
normalized, so that Q will not reach a value of 1, even on 
a perfectly mixed network. For networks with an appre- 
ciable subdivision in classes, Q usually falls in the range 
from about 0.2 to 0.7. 

In Fig. ^we plot the Q corresponding to the classes we 
determined after each split. The x-coordinate represents 
the number of steps of the algorithm which end with a 
split of the network (or of one of its components, if the 
network is not connected). We remark that, since Q is 
always calculated by using the full network, Q can only 
vary if, after the remotion of one edge, the number of 
components of the network changes, otherwise it keeps 
the value corresponding to the last subdivision. To take 
for the a;-coordinate the number of removed edges would 
result in a plot with many intervals where Q stays con- 
stant and, even if that would not affect our description, 
we do not consider it appropriate for a presentation. 

The plot presents a single peak, which exactly cor- 
responds to the splitting of the network into the four 
groups. This means that the algorithm succeeds in iden- 
tifying the four classes. The height of the peak is 0.499, 
which indicates that the network is indeed highly clus- 
tered. 

In Fig. |21we show the details of the calculation. We plot 
the information centrality of the edge removed, the 
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global efficiency E, the number of components n of the 
resulting graph and the value of Q as a function of the 
number of removed edges, i.e. as a function of the iter- 
ations of the algorithm. Each time we remove an edge 
with a high information centrality score, i.e. each time 
there is a sharp drop in the network efficiency, we also 
observe a sharp increase in the modularity. The height of 
the three main peaks in is roughly proportional to the 
corresponding variations of Q. The correlation between 
and Q is non-trivial, but we can give the following 
simple argument to explain it. Suppose that after the 
removal of an edge we get a split of the component A in 
two classes, say Ai and A2. We indicate with Ia^, IA2, 
Ia the number of edges joining pairs of vertices within 
Ai, A2 and A, respectively. Furthermore, let us denote 
with TUAi ' "^^2 1 i^A the sum of the vertex degrees of all 
the vertices of Ai, A2 and A. According to Eq. ^ the 
modularity Qb before the split is 



K ^2K' ' 



(5) 



where K is the total number of edges of the network. 
Notice that Ia/K is exactly caa of Eq. ^and ■niAl'^K 
roughly a a (with i = A). On the other hand, after the 
split, we get the modularity 



Qa 



K 



2K 



2K 



(6) 



As just a few edges keep Ai and A2 together in A, tua 
is approximately given by m^i + rnA2- So, we come to 
the following expression for the modularity variation AQ 
after the split: 



AQ = Qa - Qb 



Iai + Ia2 ~ I A mAi mA2 



K 



2K^ 



(7) 



The first term on the r.h.s. of Eq. [7|is small, because 
I A ~ Iai + IA2 , so the dominant term is the second one, 
which is proportional to the product niAi WA2 ■ On sparse 
graphs like those we are dealing with here, ruAi "^^2 is 
roughly proportional to the number of vertex pairs with 
a vertex in Ai and the other in A2 . This number of pairs 
equals the number of paths going from Ai to A2 , which 
after the split are of infinite length and give a vanish- 
ing contribution to the global efficiency of the network. 
The variation of the information centrality is then due 
to those paths, so it is proportional to AQ, as we find 
numerically. 

Our aim is of course to test how the algorithm works 
for many different networks, and this is accomplished by 
considering many different realizations of the same graph 
and checking how many vertices are correctly classified 
in each case. We analyzed our artificial networks for var- 
ious values of Zout, ranging from 4 to 7.5, with a step of 
0.25. We did not do a quantitative analysis of the in- 
terval < Zout < 4 because there the algorithm always 



finds the right classes (more than 99% of successful at- 
tempts). For each value of Zout we produced from 100 
to 500 samples, and calculated the average fraction of 
nodes which end up in their natural group. We plot such 
averages in Fig. O as a function of Zout- In the same 
plot we report the results obtained by using the algo- 
rithm of Girvan and Newman on the same network. We 
see that in the sector [4, 6] the two algorithms perform 
equally well; the algorithm of Girvan and Newman seems 
to lead in some cases to slightly better results but they 
are compatible with ours within errors except eventually 
for Zout = 5.75. This is also the region of values of Zout 
which corresponds to networks with a clear community 
structure. In the sector [6,7.5], where the communities 
are very mixed and hardly detectable, both algorithms 
start inevitably to fail, but our algorithm clearly per- 
forms better. In [7, 7.5] our results are even better than 
the ones obtained through the modularity-based algo- 
rithm recently proposed by Newman 23]. These results 
may justify the extra price in terms of CPU time that we 
have to pay if we choose to adopt the algorithm based 
on the information centrality. As far as the modularity 
is concerned, we passed from peak values of about 0.65 
for the lowest Zout we have taken (2) to about 0.25 for 
the most mixed cases {zout = 7.5). 
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FIG. 3: Average fraction of correctly identified vertices as a 
function of Zout- Each point represents an average over 100 
to 500 graphs. The comparison with the analogous results 
of Girvan and Newman shows that our algorithm performs 
better when the communities are very mixed and hardly de- 
tectable. 

As a further evidence of the similarities and differences 
between edge information and betweenness centrality we 
report in Fig. ^ a scatter plot of the two measures for 
each of the 1024 edges of the initial network, i.e. before 
we start the first iteration of the edge removal process. 
The figure shows that, as expected, the two measures 
are correlated, although there are some important differ- 
ences. In particular we notice that the edges with the 
higher information are not always those with the higher 
betweenness. This is more evident when the communi- 
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ties are mixed and hardly detectable. For instance in the 
case Zout — 7 the edge with the largest information, i.e. 
the one that will be removed by our algorithm is not the 
one with the largest betweenness. 
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FIG. 5: The karate club network of Zachary (figure taken 
from Girvan and Newman 181). 



FIG. 4: Correlation between edge information centrality and 
betweenness centrality. Each point of the scatter plot refers 
to an edge of an artificially generated network with 128 nodes 
and 1024 edges. We consider the two values Zout = 4 and 
Zout = 7, respectively representing a case in which the com- 
munities are clearly separated and a case in which the com- 
munities are mixed and hardly detectable. 



V. APPLICATIONS TO REAL NETWORKS 

After the first experiments on artificial networks, we 
can say that the algorithm seems promising. However, 
if our method is any good, it must work as well for real 
networks, which actually represent the systems we are 
mostly interested in. We present here the analysis of four 
networks, although we analyzed more. The first three 
of them, i.e. the Zachary's karate club, the network of 
the American college football teams and the food web 
of the Chesapeake Bay, have also been studied by other 
authors, with other hierarchical clustering methods. In 
this way we can better understand what the differences 
between the various approaches are. The last network 
studied represents the interactions amongst a group of 
20 monkeys. 



A. Zachary's karate club 

The first example we considered is the famous karate 
club network analyzed by Zachary It consists of 

34 persons (78 edges) whose mutual friendship relation- 
ships have been carefully investigated over a period of 
two years. Due to contrasts between a teacher and the 
administrator of the club, the club split into two smaller 
ones. The questions we want to answer are the follow- 
ing: Is it possible, by studying the network community 
structures before the network splitting, to predict the be- 
havior of the network and in particular to identify the two 



classes ? Moreover, according to the network structure 
will a possible conflict most likely involve two factions or 
multiple groups ? The network is presented in Fig. |S1 
where the squares and the circles label the members of 
the two groups. The results of our analysis are illustrated 
in the dendrogram of Fig. 

The first edge which gets removed is the one linking node 
12 to the rest of the network. This edge corresponds to 
the edge between node 12 and node 1, an edge having 
the largest information centrality (0.024) and a medium 
value of betweenness (66) as shown in the scatter plot 
reported in Fig. d Notice also that the edge with the 
highest betweenness (142.79) is the edge connecting node 
1 with node 32. The removal of the first edge then leads 
to the isolation of node 12. This is a feature that we 
encountered other times in our analyses. The early sep- 
aration of a single node or of a small group is due to the 
fact that a system often looses more efficiency because of 
such splits than through the removal of intercommunities 
edges. 

To see why this is so, let us consider the simple example 
of Fig. |S1 describing a network G with N nodes composed 
by two cohesive subgroups, namely Gi with A^i nodes, 
and G2 with Ni nodes (TVi 7V2 < 1), and by the 
two nodes fc, which is joined to the network via a single 
edge (like node 12 in the karate club) and z, bridging 
Gi to G2. In such a case the separation of the node 
k leads to a decrease of efficiency proportional to the 
number of remaining nodes, i.e. AEk-spiu cx: 0{N). 
In fact, because of the single edge, the shortest paths 
between pairs of nodes different from k are not affected 
by the removal of the edge, so the only contributions 
come from the paths from k to the rest of the network, 
which are — 1. On the other hand, the removal of the 
edge linking i to Gi influences the lengths of A^i x N2 
shortest paths, so that AEint-comm oc 0{N'^). In such 
a case, the edge standing between the two communities 
Gi and G2 will be the first one to be removed. But 
this is not always the case, since a simple modification 
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FIG. 7: Correlation between edge information centrality and 
betweenness centrality for the karate club network. Each 
point of the scatter plot refers to an edge of the network. 



node 12 is removed from the network of the karate club, 
also the loosely bound node 27 (just two edges) isolates 
from the rest. The third split finally separates the two 
big groups. At this stage we have four components, two 
isolated nodes (12 and 27) and two larger groups which 
are homogeneous except node 10 which is misclassified 
(curiously enough, this node is also misclassified by the 
fast algorithm of Newman |2^). The separation of the 
four above mentioned clusters corresponds to a peak in 
the plot of Q. However there is a second higher peak 
which is obtained for a split of the network into seven 
communities. This double peak structure is present as 
well in the Q-plot of the Girvan-Newman analysis 0, 



m 



FIG. 6: Dendrogram of the communities of the karate club. 
Initially one has the split of two loosely bound nodes, 12 and 
27, from the rest of the network. After that the two commu- 
nities, with the exception of node 10 (and of the two above- 
mentioned nodes), are correctly identified. The separation of 
the two communities corresponds to a peak in the modularity 




of the network considered in the figure would lead to a 
different result. In fact, if we now suppose that node i is 
connected to Gi through two edges (as for the connection 
between node i and G2) instead of a single one, then the 
algorithm will see the graph composed by Gi, G2 and 
I as a more cohesive structure than before and the first 
edge to be removed will be the one connecting k to Gi. 
Going back to the dendrogram of Fig. El we see that after 



FIG. 8; A graph G composed by a node k and two cohesive 
subgroups, Gi and G2, connected by node i. 

As for the computer generated networks, we report in 
Fig. I^lthe information centrality of the edge removed, 
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the global efficiency the number of components n of 
the resulting graph and the value of Q as a function of 
the number of edges removed from the network of the 
karate club. The figure is analogous to Fig. |21 We ob- 
serve again a correlation between the peaks of and 
the jumps of Q (here we have two). Moreover, like in the 
previous case, the absolute maximum of Q corresponds 
to the lower of the two peaks of . 
We remind that the variation of the efficiency correspond- 
ing to the remotion of one edge is calculated by taking 
into account the structure of the network at the current 
stage, i.e. without considering the edges which were elim- 
inated in the previous steps. For the algorithm of Girvan 
and Newman this condition of recalculation turns out 
to be crucial, because removing the edges according to 
the (decreasing) values of the betweenness as calculated 
from the original configuration of the network leads to 
very poor results. We wanted to check whether this is 
also true for our method. Indeed, Fig. ^| clearly shows 
that this is the case: the dendrogram does not reveal the 
real splitting of the network into the two classes, which 
instead look quite mixed up, and the modularity, whose 
values are quite low all over, presents a rather flat profile. 
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FIG. 9: Information centrality of the edge removed, global 
efficiency E, number of components n and value of Q for the 
resulting graph as a function of the number of edges removed 
for the karate club network. 



FIG. 10: Dendrogram of the communities of the karate club 
obtained by our method if we calculate the information cen- 
trality according to the initial structure of the network. This 
version of the algorithm fails to detect the communities. 



B. Network of the American college football teams 

The second network we have investigated is the col- 
lege football network, representing the schedule of games 
between American college football teams in a season. 
The teams are divided into well known " conferences" , 
which are the communities, with a higher number of 
games between members of the same conference than 
between teams of different conferences. There are al- 
together eleven conferences plus few other teams which 



do not belong to any conference. Fig. 1111 shows the den- 
drogram we have derived with our method. The pattern 
of the modularity looks similar to the one we have shown 
for the karate club, and it again presents two peaks, 
the higher of which reaches the value Q = 0.485. The 
corresponding subdivision of the network is the one we 
highlighted in the figure. We identify ten groups which 
coincide with ten conferences (either exactly or up to 
a team). The teams labeled as Sunbelt are not recog- 
nized as belonging to the same group. This group is 
misclassified as well in the analysis of Girvan and New- 
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and the spectrum of the trophic levels; the latter allowed 
to identify in js^l five clusters of taxa. Nevertheless, our 
study did not reveal any particular subdivision of the 
species. Repeating the analysis with the algorithm of 
Girvan and Newman led essentially to the same results. 
We had similar problems by analyzing other food webs; 
the reason may be the fact that these networks often 
contain many edges, and our algorithm is probably not 
suitable for the analysis of dense graphs. 
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FIG. 13: Dendrogram of the primate network. The circles 
represent the asocial monkeys, the squares the social monkeys 
(see text). There is no separation in classes; our procedure 
leads to a progressive isolation of the nodes. The modularity 
Q is very low, the higher peak is relative to a partition in a 
large group and the isolated nodes 5, 8 and 9 (besides the 
asocial primates). 



FIG. 12: Dendrogram of the communities of the Chesapeake 
Bay food web. The modularity peaks for the highlighted 
partition of the network. The two largest clusters are quite 
homogeneous, reflecting approximately the division between 
pelagic and benthic organisms. 



D. Primate Network 

In this section we consider a data set collected by Linda 
Wolfe recording 3 months of interactions amongst 

a group of 20 monkeys, where interactions were defined 
as the joint presence at the river. The dataset also con- 
tains information on the sex and the age of each animal. 



Monkeys 1-5 are males, monkeys 6-20 are females. In 
increasing order of age: monkeys 7, 14, 18, 20 belong 
to the first age group (the youngest), monkeys 4, 5, 9, 
10, 15, 17 to the second, monkeys 2, 3, 8, 12, 16 to the 
third and monkeys 1, 6, 11, 13, 19 to the fourth and 
oldest group. A detailed analysis of the individual and 
group centrality of this network can be found in Refs. 
[^|31|. The total number of links is 31, i.e. of the or- 
der of magnitude of the nodes. Indeed, six out of twenty 
monkeys did not actively participate in the social life of 
the group; the resulting non-directed non-valued graph 
thus consists of 6 isolated points (labeled by the num- 
bers 2, 6, 16, 18, 19, 20) and a connected component of 
14 points. The results of our analysis are illustrated in 
Fig. 1131 where we reported as well for each primate both 
sex (M=male, F=female) and age (in years). The mod- 
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ularity of the subsequent subdivisions of the network in 
components is very low, which shows that there is no ap- 
preciable community structure; nevertheless, two peaks 
are clearly visible, the higher of which is obtained when 
the nodes 5, 8 and 9 separate one after the other from 
the network. One gets then a major community of eleven 
elements and nine isolated monkeys. We do not find any 
sensible relationships between our partition and the di- 
vision of the primates in age groups. We analyzed the 
network as well with the method of Girvan and Newman 
and the results are essentially the same: one gets again 
two peaks for the modularity (whose values remain low) 
and the best partition of the network corresponds to a 
separation in the same large community we found before 
without node 11, which is now isolated, plus isolated sites 
except the pair 5-8. 

VI. CONCLUSIONS 

We have presented a new algorithm to identify the 
subdivisions of complex networks in cohesive groups of 
vertices, or communities. The algorithm is based on a 
recently introduced centrality measure, the so-called in- 
formation centrality, and consists in classifying all edges 
according to the value of this measure, so to determine 
which edge is most central: the latter edge is then re- 
moved from the network. One then recalculates the in- 
formation centrality of the remaining edges and again 
removes the most central edge; the procedure is repeated 
until all edges are removed. The hope is that this sequen- 
tial removal of edges looses the bonds between tightly 
connected groups of vertices, so that, at some stage, they 
eventually separate from each other. 

For the quantitative evaluation of the goodness of the 
successive splits, which is necessary in order to identity 
the best subdivision of the network, we adopted the mod- 
ularity Q introduced in 18]. Our algorithm runs to com- 
pletion in time 0{K^N) {K and N are the number of 



edges and vertices of the graph, respectively) and there- 
fore is not so fast as other methods; because of that, net- 
works with thousands of vertices are unreachable. The 
aim of the paper, however, was to check whether the in- 
formation centrality is relevant in the search of the com- 
munities. 

The results of the application of our method both 
to computer generated networks and to real networks 
clearly show that the algorithm is indeed able to detect 
the real communities in most cases. This implies the exis- 
tence of a correlation between the information centrality 
C^fe of an edge k and the fact that the edge joins two 
different communities; the higher C^k, the more likely k 
is a tie between groups. This is confirmed by the correla- 
tion we observed between the peaks of and the jumps 
in the modularity (see Figs. [21 and El . We stressed the 
importance of the recalculation of the information cen- 
trality step by step; without it the algorithm is not able 
to distinguish the communities. Our method was espe- 
cially devised for sparse graphs (i.e. when K^N), and it 
is probably doomed to fail for dense graphs {K^N'^). 

The examples we have taken allowed us as well to see 
how efficient our algorithm is compared with others. In 
particular we made extensive comparisons with the algo- 
rithm of Girvan and Newman [l^ ITsI l , which also uses a 
centrality measure, the edge betweenness. It turns out 
that our algorithm is generally as good as the one of 
Girvan and Newman. It seems to perform slightly bet- 
ter when there is a high degree of mixture between the 
classes; on the other hand, it sometimes has troubles with 
nodes which are too loosely bound to the rest of the net- 
work (like nodes with a single edge), which may separate 
too early and be misclassified, although they often hap- 
pen to be truly independent communities. 
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